Vision Transformer for Fast and Efficient Scene Text Recognition

نویسندگان

چکیده

Scene text recognition (STR) enables computers to read in natural scenes such as object labels, road signs and instructions. STR helps machines perform informed decisions what pick, which direction go, is the next step of action. In body work on STR, focus has always been accuracy. There little emphasis placed speed computational efficiency are equally important especially for energy-constrained mobile machines. this paper we propose ViTSTR, an with a simple single stage model architecture built compute parameter efficient vision transformer (ViT). On comparable strong baseline method TRBA accuracy 84.3%, our small ViTSTR achieves competitive 82.6% (84.2% data augmentation) at \(2.4\times \) up, using only 43.4% number parameters 42.2% FLOPS. The tiny version 80.3% (82.1% augmentation), \(2.5\times speed, requiring 10.9% 11.9% With augmentation, base outperforms 85.2% (83.7% without \(2.3\times but requires 73.2% more 61.5% terms trade-offs, nearly all configurations or near frontiers maximize accuracy, same time.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving text recognition by distinguishing scene and overlay text

Video texts are closely related to the content of a video. They provide a valuable source for indexing and interpretation of video data. Text detection and recognition task in images or videos typically distinguished between overlay and scene text. Overlay text is artificially superimposed on the image at the time of editing and scene text is text captured by the recording system. Typically, OC...

متن کامل

Extended Spectral Regression for efficient scene recognition

This paper proposes a novel method based on Spectral Regression (SR) for efficient scene recognition. First, a new SR approach, called Extended Spectral Regression (ESR), is proposed to perform manifold learning on a huge number of data samples. Then, an efficient Bag-of-Words (BOW) based method is developed which employs ESR to encapsulate local visual features with their semantic, spatial, sc...

متن کامل

Title of dissertation : COMPUTER VISION FOR SCENE TEXT ANALYSIS

Title of dissertation: COMPUTER VISION FOR SCENE TEXT ANALYSIS Ali Zandifar, Doctor of Philosophy, 2004 Dissertation directed by: Professor Rama Chellappa Electrical Engineering Department Co-Advisors: Dr. Ramani Duraiswami Professor Larry S. Davis Department of Computer Science The motivation of this dissertation is to develop a ‘Seeing-Eye’ video-based interface for the visually impaired to a...

متن کامل

Scene Text Recognition and Retrieval for Large Lexicons

In this paper we propose a framework for recognition and retrieval tasks in the context of scene text images. In contrast to many of the recent works, we focus on the case where an image-specific list of words, known as the small lexicon setting, is unavailable. We present a conditional random field model defined on potential character locations and the interactions between them. Observing that...

متن کامل

Unified Detection and Recognition for Reading Text in Scene Images

UNIFIED DETECTION AND RECOGNITION FOR READING TEXT IN SCENE IMAGES

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Lecture Notes in Computer Science

سال: 2021

ISSN: ['1611-3349', '0302-9743']

DOI: https://doi.org/10.1007/978-3-030-86549-8_21